CN115759518A - Usability treatment system based on chaos engineering - Google Patents

Usability treatment system based on chaos engineering Download PDF

Info

Publication number
CN115759518A
CN115759518A CN202211504121.0A CN202211504121A CN115759518A CN 115759518 A CN115759518 A CN 115759518A CN 202211504121 A CN202211504121 A CN 202211504121A CN 115759518 A CN115759518 A CN 115759518A
Authority
CN
China
Prior art keywords
drilling
fault
scene
experiment
chaotic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211504121.0A
Other languages
Chinese (zh)
Inventor
郑晖
林耘毅
赵大宝
曹橹
惠晓靖
周镇威
金永春
陈宝清
杨熙琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Merchants Bank Co Ltd
Original Assignee
China Merchants Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Merchants Bank Co Ltd filed Critical China Merchants Bank Co Ltd
Priority to CN202211504121.0A priority Critical patent/CN115759518A/en
Publication of CN115759518A publication Critical patent/CN115759518A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention relates to the technical field of information system testing, and discloses an availability management system based on chaotic engineering. When a drilling command of a user is received, a drilling object set under a user name is analyzed through an object management module, and a drilling object is selected from the drilling object set; screening out fault atoms from a fault atom library based on the drilling purpose of the drilling command, and creating a drilling scene according to the fault atoms; according to the drilling object and the drilling scene, fault drilling is established in the experiment management module, the drilling scene is scheduled to obtain a scheduled drilling scene, and fault drilling is executed in the scheduled drilling scene; monitoring the whole drilling process of fault drilling through a defect tracking module, and performing problem tracking in the drilling process of fault drilling; therefore, the fault tolerance of the business system in a complex environment is improved, and the immunity of the business system against unknown risks is improved.

Description

Usability treatment system based on chaos engineering
Technical Field
The invention relates to the technical field of information system testing, in particular to an availability management system based on chaotic engineering.
Background
Chaos engineering is the subject of tests on distributed systems, and aims to establish the capability and confidence of the system in resisting out-of-control conditions in the production environment. The fault drilling is based on the specific practice of chaotic engineering, and the availability and disaster tolerance capability of the system are verified by simulating real faults and observing business steady states.
At present, common chaos engineering platforms comprise a chaos monkey chaos engineering automation platform and an open source chaos engineering tool chaos blade, and are characterized by simple operation, no invasion and strong expansibility. However, the real scene coverage for fault simulation in the current chaotic engineering platform is incomplete, the capability of complete fault arrangement is lacked, and in addition, the chaotic engineering platform is weak in defense mechanism, is difficult to control explosion range, and is easy to influence other services and resources.
Disclosure of Invention
The invention mainly aims to provide an availability management system based on chaotic engineering, and aims to improve the fault tolerance of a service system in a complex environment and improve the immunity of the service system against unknown risks.
In order to achieve the above object, the present invention provides a chaos experimental method, which is applied to an availability management system of a chaos project, wherein the availability management system of the chaos project comprises an object management module, a fault atom library, an experiment management module and a defect tracking module, and the chaos experimental method comprises:
when a drilling command of a user is received, analyzing a drilling object set under the user name through the object management module, and selecting a drilling object from the drilling object set; wherein the set of drill objects comprises a base object and a cluster object;
screening out fault atoms from the fault atom library based on the drilling purpose of the drilling command, and creating a drilling scene according to the fault atoms;
according to the drilling object and the drilling scene, fault drilling is established in the experiment management module, the drilling scene is scheduled to obtain a scheduled drilling scene, and the fault drilling is executed in the scheduled drilling scene;
and monitoring the whole drilling process of the fault drilling through the defect tracking module, and performing problem tracking in the drilling process of the fault drilling.
Preferably, the system for managing availability of chaotic engineering further includes a registry module, and after the step of analyzing the drill object set under the user name through the object management module and selecting a drill object from the drill object set when receiving a drill command from a user, the system further includes:
monitoring whether the heartbeat of the drilling object is normal or not through the registration center module, and eliminating the drilling object with abnormal heartbeat of the drilling object under the condition that the heartbeat of the drilling object is abnormal so as to determine a target drilling object;
the step of creating a fault drill in the experiment management module according to the drill object and the drill scene comprises:
and creating fault drilling in the experiment management module according to the target drilling object and the drilling scene.
Preferably, before the step of analyzing the drilling object set under the user name through the object management module and selecting a drilling object from the drilling object set when receiving a drilling command from a user, the method further includes:
acquiring historical real fault data of a basic resource layer, a platform resource layer, an application resource layer and other middleware in the availability management system of the chaotic engineering;
sorting the historical real fault data according to a preset rule, and extracting target real event data from the sorted historical real fault data;
and extracting fault atoms according to the target real event data, and constructing a corresponding fault atom library based on the fault atoms so as to customize a corresponding drilling scene from the fault atom library in the subsequent chaotic experiment.
Preferably, the step of creating a fault drill in the experiment management module according to the drill object and the drill scene, and performing arrangement and scheduling on the drill scene to obtain an arranged drill scene, and the step of executing the fault drill in the arranged drill scene includes:
according to the drilling object and the drilling scene, establishing fault drilling and generating an experiment list in the experiment management module, and sending an operation request to an API layer;
after the API layer receives the operation request, starting an experiment, and writing the fault drilling and the experiment list into a key value storage; the key value storage stores relevant information in the fault drilling execution process;
polling related information in the key value storage through a scheduling instance scheduler, adding the drilling objects in the fault scene in a serial or parallel mode for scheduling to obtain a scheduled drilling scene, and storing the scheduled drilling scene into the key value storage;
polling the timing information of the fault drilling in the key value storage through a task instance scheduler, starting an experiment through the API layer according to the timing information, and executing the fault drilling in the scheduled fault scene.
Preferably, the polling, by the task instance scheduler, the timing information of the fault drilling in the key value storage, and starting an experiment through the API layer according to the timing information, further includes, after the step of executing the fault drilling in the planned fault scenario:
and carrying out blasting radius control on the fault drilling based on a preset steady-state index and a threshold corresponding to the steady-state index.
Preferably, the creating a fault drilling in the experiment management module according to the drilling object and the drilling scene, and performing scheduling on the drilling scene to obtain a scheduled drilling scene, and after the step of performing the fault drilling in the scheduled drilling scene, the method further includes:
and when sudden faults occur in the execution process of the fault drilling, the fault drilling is stopped.
Preferably, the usability management system for chaotic engineering further includes a platform display module, and the method further includes, after the step of monitoring the execution state of the fault drill by the defect tracking module and performing problem tracking in the execution process of the fault drill:
displaying the experimental result data of the fault drilling through the platform display module; the experimental result data comprise experiment completion indexes, experiment quantity distribution and row arrangement and problem analysis.
Preferably, the chaotic experimental method further comprises:
when a modification instruction based on bottom-layer tool extension is received, modifying a Litmus chaos code based on the modification instruction to enable a drilling scene in the availability management system of the chaotic engineering to be adapted to an enterprise-level OpenShift container platform, and enabling the availability management system of the chaotic engineering to support an OCP cluster fault scene.
Preferably, the usability treatment system for chaotic engineering further includes a log system, the creating a fault drilling in the experiment management module according to the drilling object and the drilling scene, performing scheduling on the drilling scene to obtain a scheduled drilling scene, and after the step of performing the fault drilling in the scheduled drilling scene, the log system further includes:
and under the condition that the fault drilling is successfully executed, obtaining a fault drilling result, carrying out early warning according to the fault drilling result, and storing the fault drilling result to a log system.
In addition, in order to achieve the above object, the present invention further provides an availability management system for chaotic engineering, including: an object management module, a fault atom library, an experiment management module and a defect tracking module, wherein,
the object management module is used for managing a drilling object set required by the chaotic experiment; wherein the set of drilling objects comprises a base object and a cluster object; the device is also used for selecting a drilling object from the drilling object set when a drilling command of a user is received;
the fault atom library is used for storing fault atoms required by the chaotic experiment, and the fault atoms comprise at least one of a basic resource layer, a platform resource layer, an application resource layer and other middleware; the device is also used for screening out fault atoms corresponding to the drilling purpose based on the drilling purpose of the drilling command, and creating a drilling scene according to the fault atoms;
the experiment management module is used for creating fault drilling according to the drilling object and the drilling scene, scheduling the drilling scene to obtain a scheduled drilling scene, and executing the fault drilling in the scheduled drilling scene;
and the defect tracking module is used for monitoring the whole drilling process of the fault drilling and carrying out problem tracking in the drilling process of the fault drilling.
In addition, to achieve the above object, the present invention also provides a chaos experimental apparatus, comprising:
the selection module is used for analyzing a drilling object set under the user name through an object management module when a drilling command of a user is received, and selecting a drilling object from the drilling object set; wherein the set of drilling objects comprises a base object and a cluster object;
the screening module is used for screening out fault atoms from the fault atom library module based on the drilling purpose of the drilling command and creating a drilling scene according to the fault atoms;
the creating module is used for creating fault drilling in the experiment management module according to the drilling object and the drilling scene, performing arrangement and scheduling on the drilling scene to obtain an arranged drilling scene, and executing the fault drilling in the arranged drilling scene;
and the monitoring module is used for monitoring the execution state of the fault drilling through the defect tracking module and tracking the problems in the execution process of the fault drilling.
Further, to achieve the above object, the present invention also provides a computer apparatus comprising: the chaotic experiment system comprises a memory, a processor and a chaotic experiment program which is stored on the memory and can run on the processor, wherein the chaotic experiment program realizes the steps of the chaotic experiment method when being executed by the processor.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a chaos experiment program stored thereon, where the chaos experiment program, when executed by a processor, implements the steps of the chaos experiment method as described above.
The invention provides a chaos experimental method and an availability management system of chaos engineering; the chaotic experimental method is applied to an availability management system of chaotic engineering, and the availability management system of chaotic engineering comprises an object management module, a fault atom library, an experiment management module and a defect tracking module. When a drilling command of a user is received, a drilling object set under a user name is analyzed through an object management module, and a drilling object is selected from the drilling object set; screening out a fault atom from a fault atom library based on the drilling purpose of the drilling command, and creating a drilling scene according to the fault atom; according to the drilling object and the drilling scene, fault drilling is established in the experiment management module, the drilling scene is scheduled to obtain the scheduled drilling scene, and fault drilling is executed in the scheduled drilling scene; monitoring the whole drilling process of fault drilling through a defect tracking module, and performing problem tracking in the drilling process of fault drilling; therefore, the fault tolerance of the business system in a complex environment is improved, and the immunity of the business system against unknown risks is improved.
Drawings
FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of the chaos experimental method of the present invention;
FIG. 3 is a schematic diagram of a data architecture of an availability management system of the chaotic engineering of the present invention;
FIG. 4 is a schematic diagram of an application architecture of the availability governance system of the chaotic engineering of the present invention;
FIG. 5 is a schematic diagram of a system architecture of an availability management system of the chaotic engineering according to the present invention;
FIG. 6 is a schematic flow chart of a chaos experimental method according to a second embodiment of the present invention;
FIG. 7 is a schematic diagram of a technical stack frame structure of an availability governance system of the chaotic engineering of the present invention;
FIG. 8 is a schematic flow chart of a chaos experimental method according to a third embodiment of the present invention;
FIG. 9 is a schematic flow chart of a fourth embodiment of the chaotic experimental method of the present invention;
FIG. 10 is a schematic diagram of a distributed architecture for scheduling functions of the chaotic experimental method of the present invention;
FIG. 11 is a schematic diagram of a process for performing a fault drill in the chaotic experiment method of the present invention;
fig. 12 is a schematic flow chart of the chaos experiment executed in the usability management system of the chaos engineering after modifying the litmuuschaos code according to the chaos experiment method of the present invention;
FIG. 13 is a schematic diagram illustrating the effect of the cache data of the Redis database in the chaos experimental method according to the present invention;
FIG. 14 is a functional block diagram of the chaos experimental apparatus according to the first embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
The computer equipment of the embodiment of the invention can be mobile terminal or server equipment.
As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. The communication bus 1002 is used to implement connection communication among these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory such as a disk memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a chaos experiment program.
The operating system is a program for managing and controlling terminal equipment and software resources, and supports the operation of a network communication module, a user interface module, a chaotic experiment program and other programs or software; the network communication module is used for managing and controlling the network interface 1002; the user interface module is used to manage and control the user interface 1003.
In the computer apparatus shown in fig. 1, the computer apparatus calls the chaotic experiment program stored in the memory 1005 through the processor 1001 and performs the operations in the respective embodiments of the chaotic experiment method described below.
Based on the hardware structure, the embodiment of the chaotic experimental method is provided.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the chaotic experimental method according to the present invention, and the chaotic experimental method includes:
step S10, when a drilling command of a user is received, analyzing a drilling object set under the user name through the object management module, and selecting a drilling object from the drilling object set; wherein the set of drilling objects comprises a base object and a cluster object.
The fault drilling is a practice following the chaos engineering principle, and various possible production faults and abnormal states are simulated in drilling objects such as a physical machine, a computer cluster and the like to observe the performance of the drilling objects and perform corresponding design and optimization, so that the performance and fault tolerance of the drilling objects are improved, and the real catastrophic consequences generated by the coming of an emergency are avoided. The fault drilling process generally includes the steps of selecting a drilling object, determining a drilling scene (fault injection), performing serial/parallel arrangement scheduling, monitoring, repairing, testing and verifying the drilling scene and the like.
The chaos experimental method is applied to an availability management system of chaos engineering; the availability management system of the chaotic engineering adopts a five-layer architecture, and referring to fig. 3, fig. 3 is a data architecture schematic diagram of the availability management system of the chaotic engineering; the data architecture includes:
a display layer: the user terminal accesses the usability management system of the chaotic engineering through the HTTP/HTTPS request, so that a simplified page display effect is realized;
interface layer: the operation of the usability governing system of the chaotic engineering is realized by adopting related interfaces (such as HTTP, HTTPS, webSocket, API gateway and the like);
and (3) a service layer: the service layer comprises all functional modules (such as fault injection, API (application program interface), experiment arrangement, log and alarm modules and the like) of the availability management system of the chaotic engineering;
a support layer: abundant fault scenes (such as Linux fault scenes, CPU fault scenes, k8s fault scenes and the like) are provided for fault drilling of an availability control system of the chaotic engineering;
resource layer: and the data storage and cache technology is realized by adopting MySQL, elastic search, redis and the like.
Referring to fig. 4, fig. 4 is a schematic diagram of a system technical architecture of an availability governance system of the chaotic engineering.
The availability management system of the chaos engineering supports the system to be damaged as required to carry out chaos experiments, and supports users to manually arrange fault scenes. In the scene task execution process, for an emergency situation, a one-key stop drilling experiment is supported, and the drilling process can be prevented and controlled. The design concepts of object-oriented, modular and plug-in are combined with a B/S architecture and a centralized and distributed deployment mode, so that platform expansion and integration are facilitated, and optimal technical platform support is provided for service implementation.
The availability management system of the chaotic engineering mainly comprises an object management module, a fault atom library, an experiment management module and a defect tracking module; referring to fig. 5, fig. 5 is a schematic diagram of an application architecture of an availability management system of the chaos engineering.
The system comprises an object management module, a chaos engineering usability management system and a user management module, wherein the object management module is used for authenticating the proper authority of a user after logging in the usability management system of the chaos engineering and managing a drilling object set under a user name, and the drilling object set comprises a basic object and a cluster object; the base object may include virtual machines, physical machines, other resources; the cluster object refers to an object with cluster property, and may include an OCP cluster, a K8S container cluster, and the like;
and the object management module is also used for identifying which drilling objects of each fault drilling exist when the fault drilling is created by the user. The drilling object refers to a device or a product that needs to perform fault drilling.
The fault atom library is used for screening fault atoms according to the exercise purpose of the user when the user creates the fault exercise; wherein, the fault atom library covers a basic resource layer, a platform resource layer and an application resource layer; wherein, the fault drilling contains a plurality of fault atoms, and is a permutation and combination of the plurality of fault atoms;
the experiment management module is used for creating experiments and an experiment list; the experiment creating method comprises the steps of creating experiments, wherein the experiments are used for providing a platform for a user to create own experiments, and the experiments comprise a series of operations such as experiment names, steady-state hypotheses, experiment objects, experiment environments, fault arrangement and the like; listing all drills created by all users in the availability governance system of the current chaotic engineering in an experiment list, wherein the drills comprise some basic information such as drill state, whether the drills are started, when the drills are executed, executives of the drills, result information and the like, and some basic operations are also provided in a later experiment list, for example, some details of fault drills are modified;
and the defect tracking module is used for finding and continuously tracking the risk hidden danger in the daily fault drilling process by the user, feeding back and solving the risk hidden danger, and performing secondary drilling on the risk hidden danger at a later stage.
When a drilling command of a user is detected in the availability management system of the chaotic engineering, for example, a command of performing fault drilling on a certain drilling object under a user name, a drilling object set of the user in the availability management system of the chaotic engineering can be analyzed through the object management module, and a corresponding drilling object is selected from a basic object and a cluster object of the drilling object set according to a drilling purpose of the drilling command.
And S20, screening out fault atoms from the fault atom library based on the drilling purpose of the drilling command, and creating a drilling scene according to the fault atoms.
Wherein, the fault atom library is used for extracting risks encountered in the production environment, interfacing and performing daily fault drilling. The drilling scene may include multiple fault atoms, that is, the fault scene is a permutation and combination of the multiple fault atoms.
Screening out fault atoms matched with the drilling purpose from a fault atom library according to the drilling purpose of the drilling command; and a drilling scene can be created according to the plurality of fault atoms.
For example, when a fault drilling on hardware is required to be performed on a drilling object, the drilling scene to be constructed is a corresponding fault scene of the base resource layer.
Further, the drilling scene comprises one or a combination of several of a base resource layer fault scene, a platform resource layer fault scene, an application resource layer fault scene and other middleware fault scenes.
The fault atom library covers fault scenes such as a basic resource layer, a platform resource layer, an application resource layer and other middleware.
The fault scenes of the basic resource layer mainly comprise basic hardware, basic networks, some fault scenes of basic resources and the like.
The base resource layer failure scenario may contain the following:
(1) in a host hardware fault scene, fault injection for improving related loads is performed on a CPU, an internal memory and a disk, so that the system is abnormal under the conditions of insufficient resources and high loads, the stability of the system is verified, the effectiveness of monitoring and alarming is verified, whether indexes are comprehensive or not is verified, faults are attacked, and the capabilities of related personnel positioning and fault problems are improved.
According to the host network fault scene, related fault injection is performed on network delay, packet loss, packet repetition, packet damage, packet disorder, DNS (domain name system) analysis errors and port occupation, so that the system can verify the disaster recovery red line, monitor and alarm validity and whether indexes are comprehensive or not under the condition of network abnormality, the fault is assaulted, and the capabilities of related personnel positioning and fault problems are improved.
And the host state fault scene carries out suspension and kill fault injection through the process id and the process key words, so that the self-healing capability and the fault tolerance capability of the system are verified under the condition that the process of the system is abnormal.
And the host script fault scene verifies the automatic execution capacity of the script by performing function execution delay or execution exit fault injection on the shell script.
And (II) the fault scene of the platform resource mainly comprises the fault scenes of the containerized component and the non-containerized component.
The components in the platform resource failure scenario may include the following:
(1) and the containerized component stops the platform component through the pod-delete or the container-delete and verifies the performance condition of the platform component under the condition of abnormal state. In a fault injection experiment, relevant monitoring alarm rules can be formulated and perfected based on different influences of different component anomalies on a system, and a user can be helped to quickly react when a platform core component is anomalous.
(2) Non-containerized components (such as Docker and kubel) for which fault injection cannot be achieved through a container scheme, and the platform provides relevant fault injection capability for the non-containerized components, and can inject faults for the non-containerized components related to the platform.
The failure scenario for a non-containerized component may include the following:
(1) In the Docker container fault scene, the container resources, the container network and the container state are subjected to corresponding fault injection, the container fault tolerance capability and stability, the container disaster tolerance red line, the monitoring and alarming effectiveness and whether indexes are comprehensive or not and the fault is assaulted are verified, and the capabilities of related personnel positioning and fault problems are improved.
(2) In an OCP cluster fault scene, corresponding fault injection is carried out on node resources, pod node resources, service resources and container resources, so that problems existing in a service system are found in time, the fault tolerance capability and stability of a container, the container disaster recovery red line, the effectiveness of monitoring and alarming and whether indexes are comprehensive or not and faults are suddenly attacked are verified, and the capabilities of positioning related personnel and fault problems are improved.
(3) The K8S container cluster fault scene verifies the container cluster fault-tolerant capability and stability, the container disaster-tolerant red line, the monitoring and alarming effectiveness, whether indexes are comprehensive or not and whether faults are suddenly attacked or not by performing corresponding fault injection on the node resource, the node network, the node state, the pod resource, the pod network, the container resource, the container network and the container state, and improves the capabilities of related personnel positioning and fault problems.
And (III) the application resource fault scene mainly comprises the fault scene of the application component and the like.
The application resource failure scenario may include the following:
(1) JVM fault scenes are subjected to modes of specifying classes and injecting methods into delay, return values, abnormal fault scenes and the like, the stability of application, the effectiveness of monitoring and alarming and whether indexes are comprehensive or not and faults are assaulted are verified, and the capabilities of related personnel positioning and fault problems are improved.
(2) The Servlet fault scene is a Java web interface specification, fault injection is carried out in modes of request number, request percentage, HTTP request type, java process id, java process name and the like, problems existing in application are found in time, effectiveness of monitoring and alarming and whether indexes are comprehensive or not and fault attacks are verified, and capabilities of related personnel positioning and fault problems are improved.
And step S30, creating fault drilling in the experiment management module according to the drilling object and the drilling scene, performing arrangement and scheduling on the drilling scene to obtain an arranged drilling scene, and executing the fault drilling in the arranged drilling scene.
The fault drilling is established through the experiment management module on the basis of the drilling object and the drilling scene, and meanwhile, the process of establishing the fault drilling specifically comprises a series of operations such as experiment names, steady-state hypotheses, experiment objects, experiment environments, fault arrangement and the like.
Adding a drilling object in a fault scene, and arranging and scheduling the fault scene in a serial or parallel mode to obtain an arranged scene; further, fault drilling can be automatically performed in the scheduled fault scene, wherein the automatic execution can include timing execution, non-timing execution, immediate execution and reset fault drilling; here, the fault drill refers to a combination of a drill object and a plurality of fault atoms (drill scenes).
And S40, monitoring the whole drilling process of the fault drilling through the defect tracking module, and tracking the problems in the drilling process of the fault drilling.
The fault problem tracking module monitors the whole drilling process of fault drilling, detects whether an abnormality exists in the fault drilling process, and can observe the execution state of a drilling object, the performance capacity, the instance state and the like; when an abnormality occurs in the drilling process of the fault drilling, the abnormality is fed back (to the cooperative developer/framework center staff) and solved, and secondary drilling is performed for the abnormality at a later stage.
For example, in the case of monitoring an application or a machine finding an abnormality during a fault drill, a collaborative developer/framework center is required to confirm whether the abnormality is in or out of expectation, a solution, a resolution period, a risk level, and whether the risk is suitable for other environments.
Further, after step S40, the method further includes: displaying the experimental result data of the fault drilling through the platform display module; the experimental result data comprises experiment completion indexes, experiment quantity distribution and ranking and problem analysis.
The usability treatment system of the chaotic engineering further comprises a platform display module, and the platform display module provides report statistics and can reflect the working operation state of the usability treatment system of the chaotic engineering. The display content of the platform display module mainly comprises experiment completion indexes, experiment quantity distribution and arrangement and problem analysis, and specifically comprises the following steps:
(1) annual/current month problem data statistics: and displaying the total number of problems, unknown risks and the number of solved problems, displaying the number of fault injections in the whole year (month), displaying the number of experiments, finding the number of problems and displaying the number of unknown risks.
(2) And (3) displaying the number of experimental scenes: basic resources, platform resources, application resources.
(3) Experimental conditions: the graph is used to visually show the total number of experiments since one week, yesterday and the number of experiments today.
(4) Statistics by category experiment: the pie chart is used for showing the number of experiments executed in different experiment ranges by various resource scenes, including basic resources, platform resources and application resources.
(5) And (3) historical blasting data statistics: and in the form of a histogram, realizing historical (last half year) blasting data statistics, including displaying historical (last half year) blasting number, experimental number, problem finding number, unknown risk number and problem solving number.
(6) List of experimental results: experimental results data are presented in tabular form.
(7) Problem list: the issue data is presented in the form of a list.
In this embodiment, when a drilling command of a user is received, a drilling object set under a user name is analyzed through an object management module, and a drilling object is selected from the drilling object set; screening out fault atoms from the fault atom library module based on the drilling purpose of the drilling command, and creating a drilling scene according to the fault atoms; according to the drilling object and the drilling scene, fault drilling is established in the experiment management module, the drilling scene is scheduled to obtain a scheduled drilling scene, and fault drilling is executed in the scheduled drilling scene; monitoring the execution state of the fault drilling through a defect tracking module, and tracking the problems in the execution process of the fault drilling; therefore, the fault tolerance of the business system in a complex environment is improved, and the immunity of the business system against unknown risks is improved.
Based on the first embodiment, the invention provides a second embodiment of the chaotic experimental method. Referring to fig. 7, in this embodiment, after the step S10, the method may further include:
step A10, monitoring whether the heartbeat of the drilling object is normal through the registration center module, and under the condition that the heartbeat of the drilling object is abnormal, removing the drilling object with the abnormal heartbeat of the drilling object to determine a target drilling object.
Referring to fig. 7, fig. 7 is a schematic diagram of a technical stack frame structure of an availability governance system of the chaotic engineering; the usability treatment system of the chaotic engineering further comprises a registry module, wherein the registry module is used for monitoring whether the drilling objects under the name of the user in the object management module are normal or not so as to eliminate the abnormal condition of the fault drilling caused by the abnormal heartbeat of the drilling objects before the fault drilling is executed.
Before fault drilling is executed, monitoring whether the heartbeat of a drilling object is normal or not through a registration center module; and under the condition that the heartbeat abnormality exists in the exercise objects, eliminating the exercise objects with the heartbeat abnormality, keeping exercise objects with normal heartbeats, and determining the exercise objects with normal heartbeats as target exercise objects.
In addition, in the process of fault drilling performed by the target drilling object, whether the heartbeat of the target drilling object has an abnormal state or not can be monitored through the registry module, so that the instance before drilling can be ensured to be drilled.
Step A20, the step of creating a fault drill in the experiment management module according to the drill object and the drill scene comprises:
and creating fault drilling in the experiment management module according to the target drilling object and the drilling scene.
After the target drilling object is determined through the registry module, fault drilling can be initiated according to the target drilling object and the fault scene; further, a corresponding fault drill is created in the experiment management module according to the target drill object and the fault scene; the process of creating the fault drill may specifically include a series of operations such as an experiment name, a steady-state hypothesis, an experiment object, an experiment environment, and fault orchestration.
For example, a failure scenario includes a CPU, disk, and network; the target drill subjects include: machine a, machine B, and machine C;
after the fault drilling and experiment list is created, the fault drilling tasks may be executed first by machine a, then by machine B, and finally by machine C.
The trouble shooting tasks of the machine a, the machine B, and the machine C may be executed concurrently at the same time.
In this embodiment, whether the heartbeat of the drilling object is normal is monitored through the registration center module, and the drilling object with the abnormal heartbeat of the drilling object is removed under the condition that the heartbeat of the drilling object is abnormal, so as to determine the target drilling object; creating fault drilling in an experiment management module according to the target drilling object and the drilling scene; therefore, before fault drilling is executed, whether the heartbeat of a drilling object is normal or not is monitored through the registration center module, and the accuracy of fault drilling is guaranteed.
Based on the first embodiment, a third embodiment of the chaotic experimental method is provided. Referring to fig. 8, in this embodiment, before step S10, the method may further include the following steps:
and M10, acquiring historical real fault data of a basic resource layer, a platform resource layer, an application resource layer and other middleware in the availability management system of the chaotic engineering.
The historical real fault data is obtained by extraction and evolution in 105 real historical production events produced in 3 years of real history of a base resource layer, a platform resource layer, an application resource layer and other middleware. For example, kubelet fault, kill pod, pod cpu full, pod memory full, node down, node Drain, and so on historical real fault data.
Historical real fault data of a plurality of different fault scenes generated by a basic resource layer, a platform resource layer, an application resource layer and other middleware in an availability management system of the chaotic engineering are collected.
And step M20, sequencing the historical real fault data according to a preset rule, and extracting target real event data from the sequenced historical real fault data.
The preset rule is a preset rule used for sequencing historical real fault data. For example, it may be according to the importance of the historical production events in each hierarchy, it may be the degree of influence on the system, and it may be the frequency of occurrence of the historical production events.
After the historical real fault data are collected, the historical real fault data are sequenced according to a preset rule, and target real event data are extracted from the sequenced historical real fault data.
As an example, the historical real fault data is sorted according to the importance of the historical production events in each hierarchy, the historical real fault data with higher importance is extracted from the sorted historical real fault data, and the historical real fault data is determined as the target real event data.
As another example, the historical real fault data is sorted according to the degree of influence on the system, the historical real fault data with a larger degree of influence is extracted from the sorted historical real fault data, and the historical real fault data is determined as the target real event data.
As still another example, the historical real fault data is sorted according to the occurrence frequency, the historical real fault data with a higher occurrence frequency is extracted from the sorted historical real fault data, and the historical real fault data is determined as the target real event data.
And M30, extracting fault atoms according to the target real event data, and constructing a corresponding fault atom library based on the fault atoms so as to customize a corresponding drilling scene from the fault atom library in the subsequent chaos experiment.
The fault atoms are fault types corresponding to fault scenes required by the execution of the chaotic experiment, and support the setting of corresponding performance indexes according to requirements.
As an example, failure a may be set such that the load of the entire database CPU cannot exceed 60%, failure B may be set such that the memory usage does not exceed 70%, and failure C may be set such that the response time is less than 50ms.
The fault atom library is a fault injection expert library constructed according to fault atoms extracted from historical real fault data; before the chaos experiment is executed, one or more fault atoms can be customized from a fault atom library to create a corresponding fault scene, high-voltage reading and writing are started before the fault starts, and then the fault scene is injected to verify the reliability and the capability of quickly recovering and normally providing service of the system.
After the target real event data is determined, corresponding fault atoms can be extracted from the target real event data; and then a corresponding fault atom library can be constructed according to one or more fault atoms, so that a corresponding drilling scene can be customized from the fault atom library in the subsequent chaos experiment.
As an example, a drilling scenario of a platform resource layer is customized from a fault atom library, such as:
by simulating the halt of the kubel service, whether the pod on the node can be deployed to other nodes and whether the uplink service pod can normally work after being migrated to other nodes is verified after the halt of the kubel service, and whether the continuity of the service is influenced is judged.
As another example, a corresponding drill scene is customized from a fault atom library, such as:
and (3) setting the load condition of the POD CPU through parameters, and simulating the operation of the POD CPU under the specified load condition to verify whether the business system is influenced.
As another example, a corresponding drilling scenario is customized from a failure atom library, for example, a vssan cache disk failure, and whether data read and write in a business system are affected by the vssan cache disk failure is verified:
the vSAN storage is presented by the concept of disk groups, one vSAN node is composed of 1 or more disk groups, and one disk group is composed of 1 cache disk and at least 1 capacity disk. The cache disk plays the roles of reading cache and writing cache, and the reading and writing IO of data is accelerated, but because the cache disk and the capacity disk are logically bound together to provide storage service, the fault of the cache disk is simulated through a python script command carried by a host, and the capacity disk butted below is in an offline state. Because the vSAN defaults to double copies and data are scattered to the local storage of each vSAN cluster node randomly, the IO in progress at the moment can be switched to other disk groups of the node or disk groups of other normal nodes, and data reading and writing are not influenced.
In the embodiment, historical real fault data of a basic resource layer, a platform resource layer, an application resource layer and other middleware in the availability management system of the chaotic engineering are collected; sorting the historical real fault data according to a preset rule, and extracting target real event data from the sorted historical real fault data; extracting fault atoms according to the target real event data, and constructing a corresponding fault atom library based on the fault atoms; therefore, diversified fault scenes can be customized from the fault atom library subsequently when the chaotic experiment is carried out, and the chaotic experiment can be carried out by destroying the system as required.
Based on the first embodiment, a fourth embodiment of the chaotic experimental method is provided. Referring to fig. 9, in the present embodiment, step S30 may include substeps S31 to substep S34:
and a substep S31 of creating a fault drilling and generating an experiment list in the experiment management module according to the drilling object and the drilling scene, and sending an operation request to an API layer.
The relevant information of the fault drilling listed in the experiment list can include some basic information such as drilling state, whether the drilling is started, when the drilling is performed, the performer of the drilling, result information and the like, and some basic operations are also provided in the later experiment list; for example, some details of the fault drill are modified.
And according to the drilling object and the drilling scene, creating a corresponding fault drilling in the experiment management module, and generating an experiment list according to the relevant information of the fault drilling.
Referring to fig. 10, fig. 10 is a schematic diagram of a distributed architecture for orchestrating scheduling functions.
The distributed architecture for arranging and scheduling functions can comprise a plurality of modules such as an API layer, a key value storage, an arrangement instance scheduler, a task instance scheduler, a role monitor, a task converter and the like; wherein:
and the API layer is used for receiving the request of the experiment management layer, writing the experiment to be executed into the key value storage, and is also responsible for receiving the registration information of the plug-in module.
The key value storage is a key value storage system and is used for storing relevant information in experimental operation, such as an arrangement model, a task model, plug-in module registration information, role and node mapping information of a user distributed system and the like.
And the scheduling instance scheduler and the task instance scheduler analyze the corresponding model instances and perform corresponding operation according to the states by polling key value storage. And if the task instance scheduler analyzes and finds that the task operation needs to be started or stopped, the task converter is called, role analysis and instruction conversion are carried out by the task converter, and the operation request is issued to the corresponding target node.
And the role monitor is used for monitoring the mapping relation between roles and nodes in the user distributed system, updating the roles and the nodes to the key value storage in real time, and processing the role migration when the task instance scheduler polls.
After the experiment list is generated, an operation instruction is sent to the API layer through the experiment management module; wherein, the operation instruction refers to an instruction for initiating fault drilling.
Substep S32, after receiving the operation request, the API layer starts an experiment, and writes the fault drilling and the experiment list into a key value for storage; and storing the relevant information in the fault drilling execution process in the key value storage.
The API layer is responsible for receiving an operation request from the experiment management module, namely starting an experiment through the API layer, and writing a fault drilling and experiment list into a key value storage after analyzing the operation request; and storing the relevant information in the fault drilling execution process in the key value storage.
Besides, the API layer is also responsible for registering a user-defined task module developed in a plug-in mode except for the operation request of the experiment management layer, receives the registration request from the plug-in module, and writes the registration information into a registry of key value storage after authentication.
And a substep S33, polling the related information in the key value storage by the scheduling instance scheduler, adding the drilling object in the fault scene in a serial or parallel mode for scheduling to obtain a scheduled drilling scene, and storing the scheduled drilling scene in the key value storage.
The orchestration instance scheduler (flow instance scheduler) is responsible for polling the orchestration model instances in the key value store, and performs different operations according to the execution states (execute _ status) after parsing. Wherein, when the layout instance is created, i.e. representing the beginning of an experiment, the experiment continues until the instance is deleted. The life cycle of the programming example starts from the creation of the programming example, and is deleted after the task stage specified by each state machine is executed in sequence until the programming is terminated or the execution is finished.
After an experiment is started on an API layer, a fault drilling and experiment list in key value storage is polled through an arrangement example scheduler, drilling objects are added in a fault scene in a serial or parallel mode for arrangement and scheduling, and an arranged drilling scene is obtained; and then the arranged drilling scenes can be stored in a key value storage.
And a substep S34 of polling the timing information of the fault drilling in the key value storage through a task instance scheduler, and executing the fault drilling in the planned fault scene after being triggered through the API layer according to the timing information.
The task instance refers to a task which is running, the scheduling instance scheduler creates or deletes the task when performing state flow, and the task instance scheduler only schedules whether the task instance should start or stop executing. For example, if the task instance reaches the set duration and the concurrent instance also times out, the task instance is deleted, which represents that the life cycle is ended; and if the concurrent example is not finished, waiting for the last finished concurrent task example to call the flow and delete the flow.
The timing information refers to an execution time set for the fault drill.
Polling the task instances of fault drilling in the key value storage through a task instance scheduler to obtain the timing information of each task instance; furthermore, after the timing information of the task instance is reached, the experiment can be started through the API layer, and fault drilling can be executed in the scheduled fault scene.
In addition, the usability treatment system of the chaotic engineering also supports a multi-concurrency experiment function, and the multi-concurrency experiment capability is realized through the following modes, which specifically comprise:
(1) current limiting: the current limiting can adopt the connection waiting number and waiting time of a limiting server, only a small number of users are released each time, and more users are in a false queuing state.
(2) Page staticizing: and (5) using a Freemarker to carry out staticizing on the page, and reducing the interaction with a back-end server. This reduces server stress.
(3) Introduction of Redis: the current limiting and the staticizing are both used for relieving the pressure of the rear end of the server, the request still falls into the server, the interaction between the rear end code and the database can reduce the corresponding speed, and the platform adopts Redis to read data at a high speed.
When the task instance manager finds a plurality of concurrent tasks after polling the key value storage, the task instance manager can support the simultaneous execution of a plurality of task instances in the execution process of fault drilling.
In addition, the usability treatment system of the chaotic engineering also supports a user to check the drilling process node information in the fault drilling execution process in real time and monitor the operation state of each interface in the service in the fault drilling execution process.
Further, in this embodiment, after step S34, the method may further include: and controlling the blasting radius of the fault drilling based on a preset steady-state index and a threshold value corresponding to the steady-state index.
Wherein, the steady-state index refers to a system index parameter when the system is in a normal running state; the threshold value is defined as a steady-state index which cannot exceed a specified value in the fault drilling process; for example, the experiment duration, the CPU utilization, and the memory utilization may be set in various ways.
Monitoring the steady-state index change in real time in the process of executing fault drilling; automatically stopping the fault drilling when the steady-state index is found to exceed the corresponding threshold value; therefore, the safety and the controllability of the fault drilling are realized.
In the embodiment, according to the drilling object and the drilling scene, a fault drilling and experiment list is created in the experiment management module, and an operation request is sent to the API layer; starting an experiment after an API layer triggers an operation request, and writing a fault drilling and experiment list into a key value storage; polling related information in the key value storage through a scheduling instance scheduler, adding a drilling object in a fault scene in a serial or parallel mode for scheduling to obtain a scheduled drilling scene, and storing the scheduled drilling scene into the key value storage; polling timing information of fault drilling in the key value storage through a task instance scheduler, starting an experiment through an API (application programming interface) layer according to the timing information, and executing the fault drilling in the arranged fault scene; thereby realizing enterprise-level arrangement and scheduling of fault drilling.
Based on the first embodiment, a fifth embodiment of the chaotic experimental method is provided. In this embodiment, after the step S30, the method may further include:
and D10, when sudden faults occur in the execution process of the fault drilling, stopping the fault drilling.
Wherein, the abnormity refers to the emergency condition in the original process of the fault; such as a node down, reboot, a Pod being killed, pod network delay, etc.
Referring to fig. 11, fig. 11 is a schematic flow chart of performing a fault drill; in the fault drilling execution process, a user can check monitoring data information of the service in real time and monitor; in the execution process of the fault drilling, under the condition that the emergency situation and the support abnormity occur, the usability treatment system of the chaotic engineering also supports one-key stop of the fault drilling, so that the prevention and the control of the drilling process are realized.
In the embodiment, an emergency situation occurs in the execution process of the fault drilling, and the fault drilling is stopped by one key, so that the drilling process can be prevented and controlled.
Based on the first embodiment, a sixth embodiment of the chaotic experimental method is provided. In this embodiment, the method may further include:
and P10, when a modification instruction based on bottom tool extension is received, modifying the LitmusChaos code based on the modification instruction to enable a drilling scene in the availability management system of the chaotic engineering to be adapted to an enterprise-level OpenShift container platform, and enable the availability management system of the chaotic engineering to support an OCP cluster fault scene.
At present, a Litmus chaos community version has defects, and the defects can cause that a fault scene aiming at pod IO cannot work normally. In the process of analyzing the defects in the early stage, workers (cooperative developers/framework center personnel) do sufficient verification and test on OpenShift and native Kubernets, and cannot work normally. It was found by multiple rounds of technical analysis that this defect was due to the problem of incompatibility of the very underlying Dockerhoack with the cgroup.
cgroup is a function of the Linux kernel to limit, control and separate resources of a process group, and the bug relates to the component puba used by litmus chaos. To address this challenge, the worker (co-developer/framework centric personnel) eventually fixes the bug by modifying the code of litmus chaos and puba.
When the usability improvement system of the chaotic engineering receives a modification instruction expanded by a bottom-layer tool, the Litmus chaos code can be modified according to the modification instruction, so that a drilling scene in the usability improvement system of the chaotic engineering can be adapted to an enterprise-level OpenShift container platform, and the usability improvement system of the chaotic engineering can support an OCP cluster fault scene.
Before a modification instruction is received, because the LitmusChaos is developed for kubernets, part of chaotic engineering scenes in the usability management system of chaotic engineering can only run on the kubernets of a community. And then, adapting OpenShift by modifying the code of LitmusChaos, so that the availability management system of the chaotic engineering can perfectly support the OCP cluster, and the optimized fault injection assembly has the characteristics of high cohesion, low coupling, high portability and the like.
As an example, referring to fig. 12, fig. 12 is a schematic flow chart of performing a chaos experiment in an availability governance system of chaos engineering after modifying a litmus chaos code; after the availability management system of the chaos engineering finishes modifying the Litmus chaos code, an application POD of an experiment can be designated in an application program, a corresponding fault scene can be customized in a chaos experiment store to create a corresponding chaos experiment, and the application POD and the chaos experiment of the designated experiment are led into a chaos test engine; and then the Litmus Operator can monitor the chaos test engine, and the chaos test Operator executes the task of the chaos experiment to generate a corresponding chaos experiment result.
In the embodiment, the code repair based on the bottom-layer tool extension is performed on the availability management system of the chaotic engineering, so that the drilling scene in the availability management system of the chaotic engineering is adapted to an enterprise-level OpenShift container platform, and the availability management system of the chaotic engineering supports an OCP cluster fault scene.
Based on the first embodiment, a seventh embodiment of the chaotic experimental method of the present invention is provided. In this embodiment, after the step S40, the method may further include:
and E10, under the condition that the fault drilling is successfully executed, acquiring a fault drilling result, carrying out early warning according to the fault drilling result, and storing the fault drilling result to a log system.
Referring to fig. 4, fig. 4 is a schematic diagram of an application architecture of an availability management system of the chaotic engineering; the usability treatment system of the chaotic engineering also comprises a log system, wherein the log system can be divided into a MySQL database, an elastic search database, a Redis database and the like to realize data storage; wherein,
the MySQL database is used for storing drilling task data of fault drilling; the drilling task data refers to relevant data generated by fault drilling;
the ElasticSearch database is used for storing logs, fault drilling requests and request data corresponding to the fault drilling requests;
and the Redis database is used for storing the authentication duration and the authentication validity of the identity nameplate token of the usability control system for the user to log in the chaotic engineering. Referring to fig. 13, fig. 13 is a schematic diagram illustrating an effect of the Redis database caching data.
Under the condition that the fault drilling is successfully executed, a fault drilling result needs to be obtained, and early warning is carried out according to the fault drilling result, so that a worker (a cooperative developer/a framework center worker) can verify the effectiveness and the authenticity of the fault drilling after the fault occurs.
Meanwhile, the fault drilling result is stored in a log system so as to help a user to inquire the experimental condition of the fault drilling.
In this embodiment, under the condition that the fault drilling is successfully executed, obtaining a fault drilling result, performing early warning according to the fault drilling result, and storing the fault drilling result in a log system; the method and the device realize the verification of the validity and authenticity of the fault drilling result after the fault drilling is finished.
The invention also provides an availability governance system of the chaotic engineering, which is used for executing the chaotic experimental method and comprises an object management module, a fault atom library, an experimental management module and a defect tracking module, wherein,
the object management module is used for managing a drilling object set required by the chaotic experiment; wherein the set of drilling objects comprises a base object and a cluster object; the device is also used for selecting a drilling object from the drilling object set when a drilling command of a user is received;
the fault atom library is used for storing fault atoms required by the chaos experiment, and the fault atoms comprise at least one of a basic resource layer, a platform resource layer, an application resource layer and other middleware; the device is also used for screening out a fault atom corresponding to the drilling purpose based on the drilling purpose of the drilling command, and creating a drilling scene according to the fault atom;
the experiment management module is used for creating fault drilling according to the drilling object and the drilling scene, arranging and scheduling the drilling scene to obtain an arranged drilling scene, and executing the fault drilling in the arranged drilling scene;
and the defect tracking module is used for monitoring the whole drilling process of the fault drilling and carrying out problem tracking in the drilling process of the fault drilling.
The invention also provides a chaos experimental device. Referring to fig. 14, the chaos experiment apparatus of the present invention includes:
a selecting module 10, configured to, when a drilling command of a user is received, parse out a drilling object set under the user name through an object management module, and select a drilling object from the drilling object set; wherein the set of drilling objects comprises a base object and a cluster object;
the screening module 20 is configured to screen out a fault atom from the fault atom library based on the drilling purpose of the drilling command, and create a drilling scene according to the fault atom;
a creating module 30, configured to create a fault drill in an experiment management module according to the drill object and the drill scene, perform arrangement and scheduling on the drill scene to obtain an arranged drill scene, and execute the fault drill in the arranged drill scene;
and the monitoring module 40 is used for monitoring the fault drilling through a defect tracking module and tracking the problems in the fault drilling process.
In addition, the present invention also provides a computer readable storage medium having a chaotic experimental program stored thereon, the chaotic experimental program, when executed by a processor, implementing the steps of the chaotic experimental method as described above.
The method implemented when the chaos experiment program running on the processor is executed may refer to each embodiment of the chaos experiment method of the present invention, and is not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or system comprising the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A chaotic experimental method is characterized in that the chaotic experimental method is applied to an availability management system of chaotic engineering, the availability management system of chaotic engineering comprises an object management module, a fault atom library, an experiment management module and a defect tracking module, and the chaotic experimental method comprises the following steps:
when a drilling command of a user is received, analyzing a drilling object set under the user name through the object management module, and selecting a drilling object from the drilling object set; wherein the set of drill objects comprises a base object and a cluster object;
screening out fault atoms from the fault atom library based on the drilling purpose of the drilling command, and creating a drilling scene according to the fault atoms;
according to the drilling object and the drilling scene, fault drilling is established in the experiment management module, the drilling scene is scheduled to obtain a scheduled drilling scene, and the fault drilling is executed in the scheduled drilling scene;
and monitoring the whole drilling process of the fault drilling through the defect tracking module, and performing problem tracking in the drilling process of the fault drilling.
2. The chaotic experimental method of claim 1, wherein the usability management system for chaotic engineering further comprises a registry module, and the method further comprises, after the step of analyzing the drilling object set under the user name through the object management module and selecting a drilling object from the drilling object set when receiving a drilling command from a user, the step of:
monitoring whether the heartbeat of the drilling object is normal or not through the registration center module, and eliminating the drilling object with the abnormal heartbeat of the drilling object under the condition that the heartbeat of the drilling object is abnormal so as to determine a target drilling object;
the step of creating a fault drill in the experiment management module according to the drill object and the drill scene comprises:
and creating fault drilling in the experiment management module according to the target drilling object and the drilling scene.
3. The chaotic experimental method as claimed in claim 1, wherein before the step of analyzing the exercise object set under the user name through the object management module and selecting an exercise object from the exercise object set when receiving the exercise command from the user, the chaotic experimental method further comprises:
acquiring historical real fault data of a basic resource layer, a platform resource layer, an application resource layer and other middleware in the availability management system of the chaotic engineering;
sorting the historical real fault data according to a preset rule, and extracting target real event data from the sorted historical real fault data;
and extracting fault atoms according to the target real event data, and constructing a corresponding fault atom library based on the fault atoms so as to customize a corresponding drilling scene from the fault atom library in the subsequent chaotic experiment.
4. The chaotic experimental method as claimed in claim 1, wherein the step of creating a fault drill in the experiment management module according to the drill object and the drill scene, performing scheduling on the drill scene to obtain a scheduled drill scene, and executing the fault drill in the scheduled drill scene comprises:
according to the drilling object and the drilling scene, establishing fault drilling and generating an experiment list through the experiment management module, and sending an operation request to an API layer;
after the API layer receives the operation request, starting an experiment, and writing the fault drilling and the experiment list into a key value storage; the key value storage stores relevant information in the fault drilling execution process;
polling related information in the key value storage through a scheduling instance scheduler, adding the drilling objects in the fault scene in a serial or parallel mode for scheduling to obtain a scheduled drilling scene, and storing the scheduled drilling scene into the key value storage;
polling the timing information of the fault drilling in the key value storage through a task instance scheduler, starting an experiment through the API layer according to the timing information, and executing the fault drilling in the scheduled fault scene.
5. The chaotic experimental method of claim 1, wherein the polling the timing information of the fault drilling in the key-value store by the task instance scheduler, and starting an experiment by the API layer according to the timing information, further comprising, after the step of performing the fault drilling in the planned fault scenario:
and controlling the blasting radius of the fault drilling based on a preset steady-state index and a threshold value corresponding to the steady-state index.
6. The chaotic experimental method of claim 1, wherein the creating a fault drill in the experiment management module according to the drill object and the drill scene, performing scheduling on the drill scene to obtain a scheduled drill scene, and after the performing the fault drill in the scheduled drill scene, further comprises:
and when sudden faults occur in the execution process of the fault drilling, the fault drilling is stopped.
7. The chaotic experimental method according to claim 1, wherein the usability management system for chaotic engineering further comprises a platform display module, and the system further comprises, after the step of monitoring the execution state of the fault drill by the defect tracking module and performing problem tracking in the execution process of the fault drill:
displaying the experimental result data of the fault drilling through the platform display module; the experimental result data comprises experiment completion indexes, experiment quantity distribution and ranking and problem analysis.
8. The chaotic experimental method of claim 1, further comprising:
when a modification instruction based on bottom-layer tool extension is received, modifying a Litmus chaos code based on the modification instruction to enable a drilling scene in the availability management system of the chaotic engineering to be adapted to an enterprise-level OpenShift container platform, and enabling the availability management system of the chaotic engineering to support an OCP cluster fault scene.
9. The chaotic experimental method according to claim 1, wherein the chaotic engineering usability management system further comprises a log system, and the chaotic engineering usability management system creates a fault drill in the experiment management module according to the drill object and the drill scene, performs orchestration and scheduling on the drill scene to obtain an orchestrated drill scene, and further comprises, after the step of executing the fault drill in the orchestrated drill scene:
and under the condition that the fault drilling is successfully executed, obtaining a fault drilling result, carrying out early warning according to the fault drilling result, and storing the fault drilling result to a log system.
10. An availability governance system of chaotic engineering, for performing the chaotic experimental method of claim 1, comprising an object management module, a fault atom library, an experiment management module, and a defect tracking module, wherein,
the object management module is used for managing a drilling object set required by the chaotic experiment; wherein the set of drilling objects comprises a base object and a cluster object; the device is also used for selecting a drilling object from the drilling object set when a drilling command of a user is received;
the fault atom library is used for storing fault atoms required by the chaos experiment, and the fault atoms comprise at least one of a basic resource layer, a platform resource layer, an application resource layer and other middleware; the device is also used for screening out a fault atom corresponding to the drilling purpose based on the drilling purpose of the drilling command, and creating a drilling scene according to the fault atom;
the experiment management module is used for creating fault drilling according to the drilling object and the drilling scene, arranging and scheduling the drilling scene to obtain an arranged drilling scene, and executing the fault drilling in the arranged drilling scene;
and the defect tracking module is used for monitoring the whole drilling process of the fault drilling and carrying out problem tracking in the drilling process of the fault drilling.
CN202211504121.0A 2022-11-28 2022-11-28 Usability treatment system based on chaos engineering Pending CN115759518A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211504121.0A CN115759518A (en) 2022-11-28 2022-11-28 Usability treatment system based on chaos engineering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211504121.0A CN115759518A (en) 2022-11-28 2022-11-28 Usability treatment system based on chaos engineering

Publications (1)

Publication Number Publication Date
CN115759518A true CN115759518A (en) 2023-03-07

Family

ID=85339548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211504121.0A Pending CN115759518A (en) 2022-11-28 2022-11-28 Usability treatment system based on chaos engineering

Country Status (1)

Country Link
CN (1) CN115759518A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132390A (en) * 2024-05-10 2024-06-04 中移(苏州)软件技术有限公司 Fault drilling method and device and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132390A (en) * 2024-05-10 2024-06-04 中移(苏州)软件技术有限公司 Fault drilling method and device and electronic equipment

Similar Documents

Publication Publication Date Title
US10684940B1 (en) Microservice failure modeling and testing
CN113067728B (en) Network security attack and defense test platform
Xu et al. Early detection of configuration errors to reduce failure damage
Xu et al. POD-Diagnosis: Error diagnosis of sporadic operations on cloud applications
CN108399114B (en) System performance testing method and device and storage medium
US10083027B2 (en) Systems and methods for managing software development environments
US9189357B2 (en) Generating machine state verification using number of installed package objects
JP5684946B2 (en) Method and system for supporting analysis of root cause of event
CN111831569A (en) Test method and device based on fault injection, computer equipment and storage medium
US20040153837A1 (en) Automated testing
CN111124919A (en) User interface testing method, device, equipment and storage medium
CN111881014B (en) System test method, device, storage medium and electronic equipment
CN111258591B (en) Program deployment task execution method, device, computer equipment and storage medium
US20190073292A1 (en) State machine software tester
US20030217155A1 (en) Send of software tracer messages via IP from several sources to be stored by a remote server
CN112650688B (en) Automated regression testing method, associated device and computer program product
CN110928777B (en) Test case processing method, device, equipment and storage medium
CN115759518A (en) Usability treatment system based on chaos engineering
CN114968272A (en) Algorithm operation method, device, equipment and storage medium
CN114416589B (en) Network target range comprehensive detection method, device, equipment and readable storage medium
JP2017016507A (en) Test management system and program
CN110134558B (en) Method and device for detecting server
CN111722917A (en) Resource scheduling method, device and equipment for performance test task
CN115617668A (en) Compatibility testing method, device and equipment
Tixeuil et al. An overview of existing tools for fault-injection and dependability benchmarking in grids

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination