CN113342650A - Chaos engineering method and device for distributed system - Google Patents

Chaos engineering method and device for distributed system Download PDF

Info

Publication number
CN113342650A
CN113342650A CN202110603040.5A CN202110603040A CN113342650A CN 113342650 A CN113342650 A CN 113342650A CN 202110603040 A CN202110603040 A CN 202110603040A CN 113342650 A CN113342650 A CN 113342650A
Authority
CN
China
Prior art keywords
fault
test
data
test case
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110603040.5A
Other languages
Chinese (zh)
Other versions
CN113342650B (en
Inventor
张晓娜
暨光耀
傅媛媛
黄琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110603040.5A priority Critical patent/CN113342650B/en
Publication of CN113342650A publication Critical patent/CN113342650A/en
Application granted granted Critical
Publication of CN113342650B publication Critical patent/CN113342650B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3684Test management for test design, e.g. generating new test cases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a chaotic engineering method and a chaotic engineering device for a distributed system, which relate to the technical field of distributed systems and chaotic engineering, wherein the method comprises the following steps: acquiring test data of a distributed system and server equipment data through code embedding points; replacing the test data with abnormal data to form an abnormal data test case; generating a fault type related to a corresponding fault point according to server equipment data and a pre-established fault expert library to form a fault test case; the fault expert database is the relation among the server equipment type, the server equipment fault type and the fault occurrence probability; and executing the abnormal data test case and the fault test case to obtain the test result of the distributed system. The invention can comprehensively and efficiently improve the robustness and the high availability of the distributed system.

Description

Chaos engineering method and device for distributed system
Technical Field
The invention relates to the technical field of distributed systems and chaotic engineering, in particular to a chaotic engineering method and device for a distributed system.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In recent years, as a system architecture is developed from a single application to a distributed system, development efficiency and system expandability are gradually improved, but at the same time, complexity of the system is increased, and a traditional service testing method cannot comprehensively cover all possible behaviors of the system. With the continuous development of micro services and the continuous increase of the scale of a system, the dependence between services also brings about a lot of uncertainties, and in such a complex call network, the exception of any ring may affect other services. And due to the increase of service nodes, the probability and randomness of faults are increased, and how to improve the robustness and high availability of the distributed system becomes a problem to be solved urgently.
At present, the robustness and high availability of most distributed systems are realized through chaotic engineering, and the robustness of the systems is verified mainly through simulating faults.
The conventional chaotic engineering only verifies the system performance in an unexpected fault or only verifies the system performance in an abnormal parameter, and the chaotic engineering and the abnormal parameter are not verified in a combined manner, so that the problem of incomplete test coverage exists, and the robustness and the high availability of the system cannot be effectively improved. In addition, most of the current chaotic engineering methods cannot be completely automated, and the method needs to manually design and execute test cases, so that the problems of low test efficiency and high labor consumption exist.
Disclosure of Invention
The embodiment of the invention provides a chaotic engineering method of a distributed system, which is used for comprehensively and efficiently improving the robustness and the high availability of the distributed system, and comprises the following steps:
acquiring test data of a distributed system and server equipment data through code embedding points;
replacing the test data with abnormal data to form an abnormal data test case;
generating a fault type related to a corresponding fault point according to server equipment data and a pre-established fault expert library to form a fault test case; the fault expert database is the relation among the server equipment type, the server equipment fault type and the fault occurrence probability;
and executing the abnormal data test case and the fault test case to obtain the test result of the distributed system.
The embodiment of the invention also provides a chaos engineering device of the distributed system, which is used for comprehensively and efficiently improving the robustness and the high availability of the distributed system, and comprises the following steps:
the acquisition unit is used for acquiring test data of the distributed system and server equipment data acquired through code embedded points;
the abnormal data test case generating unit is used for replacing the test data with abnormal data to form an abnormal data test case;
the fault test case generation unit is used for generating fault types related to corresponding fault points according to the server equipment data and a pre-established fault expert database to form fault test cases; the fault expert database is the relation among the server equipment type, the server equipment fault type and the fault occurrence probability;
and the test unit is used for executing the abnormal data test case and the fault test case to obtain a test result of the distributed system.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the chaos engineering method of the distributed system when executing the computer program.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing the chaotic engineering method of the distributed system is stored in the computer-readable storage medium.
In the embodiment of the invention, compared with the technical scheme that the test coverage is not complete, the robustness and the high availability of the system cannot be effectively improved, and the test case needs to be designed manually and then executed, so that the efficiency is low in the chaos engineering scheme of the distributed system, the chaos engineering scheme of the distributed system is characterized in that: acquiring test data of a distributed system and server equipment data through code embedding points; replacing the test data with abnormal data to form an abnormal data test case; generating a fault type related to a corresponding fault point according to server equipment data and a pre-established fault expert library to form a fault test case; the fault expert database is the relation among the server equipment type, the server equipment fault type and the fault occurrence probability; and executing the abnormal data test case and the fault test case to obtain the test result of the distributed system, thereby comprehensively and efficiently improving the robustness and the high availability of the distributed system.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
FIG. 1 is a schematic flow chart of a chaotic engineering method for a distributed system according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a chaos engineering method for a distributed system according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of transaction links and involved servers in a service invocation relationship in accordance with an embodiment of the present invention;
FIG. 4 is a flowchart illustrating an exemplary process of executing an abnormal data test case according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart illustrating the execution of a failure test case according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a chaos engineering apparatus of a distributed system in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
Before describing the embodiments of the present invention, terms related to the embodiments of the present invention will be described.
1. Chaotic engineering: the chaos engineering is a subject of experiments on a distributed system, aims to build confidence of the ability of the system to bear turbulent flow conditions in a production environment, can be regarded as experiments for revealing system weaknesses, and has higher confidence of system behaviors when the difficulty of breaking a steady state is higher.
2. Jmeter is a Java-based testing tool developed by Apache organization, and can be used for performing pressure test, function test and regression test on software.
3. postman is a powerful tool for testing HTTP interfaces developed by Postdot Technologies, inc.
4. The ChaosBlade is a tool for injecting faults of an open source code of the Ali baba in 2018, can be used for injecting faults of a CPU, a memory, a network, a disk and the like, and also supports secondary development and optimization according to needs.
5. LoadRunner is software developed by Hewlett packard, is mainly used for automatic load testing of various system architectures, and can predict system behaviors and evaluate system performance.
In order to more comprehensively and efficiently improve the robustness and the high availability of a distributed system, the embodiment of the invention provides a chaotic engineering scheme of the distributed system, firstly, code embedded points are added to collect related data and links, and then, the collected test data is replaced by abnormal data to form an abnormal data test case in the modes of data replacement and the like; in addition, aiming at the collected server equipment data, a fault test case is generated by combining a fault expert database; and then automatically executing the generated test case, collecting related monitoring data and sending the test result to a tester. According to the chaotic engineering test method, more comprehensive chaotic engineering test cases (abnormal data test cases and fault test cases) can be automatically generated without additional manpower input and automatically executed, so that the labor cost is reduced, the test coverage is improved, and the robustness and the high availability of a distributed system are further improved under the condition of not increasing the manpower. The chaotic engineering scheme of the distributed system is described in detail below.
Fig. 1 is a schematic flow chart of a chaos engineering method of a distributed system in an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
step 101: acquiring test data of a distributed system and server equipment data through code embedding points;
step 102: replacing the test data with abnormal data to form an abnormal data test case;
step 103: generating a fault type related to a corresponding fault point according to server equipment data and a pre-established fault expert library to form a fault test case; the fault expert database is the relation among the server equipment type, the server equipment fault type and the fault occurrence probability;
step 104: and executing the abnormal data test case and the fault test case to obtain the test result of the distributed system.
The chaos engineering method of the distributed system provided by the embodiment of the invention can realize that the robustness and the high availability of the distributed system are comprehensively and efficiently improved. The steps involved in the method are described in detail below with reference to fig. 2 to 5.
First, step 101, i.e., step 1 in fig. 2, is described.
During specific implementation, code embedding points are made in the codes, relevant data including transaction links, service names, parameters, container IDs, database IPs and the like are collected through the embedding points, and the collected data are subjected to primary processing and stored in a database.
In specific implementation, the buried point is a term in the field of data collection, and is known as Event Tracking (Event Tracking), which refers to a related technology for capturing, processing and transmitting specific user behaviors or events and an implementation process thereof.
In one embodiment, the chaos engineering method for a distributed system may further include: and carrying out primary processing on the test data to obtain the test data after the primary processing.
In an embodiment, the preliminary processing the test data to obtain the preliminary processed test data may include: and splitting the test data message string to obtain the test data of the message string with the preset format.
In specific implementation, the processing of the data mainly refers to processing of collected parameters of the service, most of the parameters of the service are in the form of a message string, so that test data can be replaced in subsequent steps, and further the efficiency of testing the distributed system is improved, specifically, the message string of the parameters is mainly split, and the message string can be split into the following formats: parameter name | parameter type | parameter value.
To facilitate an understanding of how the present invention may be implemented, examples of collected test data and server device data are described below in conjunction with FIG. 3.
As shown in fig. 3, the transaction link and the involved server in the service invocation relationship provided by the embodiment of the present invention: as shown in fig. 3: the service A calls the service B, the service B calls the service C, and the service C calls the service D. When the service A calls the service B, three parameters are transmitted, namely (B1, int (3)), (B2, str (12)) (B3 and BigDecimal), wherein when the service A calls the service B, the parameter value transmitted to B1 is 10, the parameter value transmitted to B2 is BBBBB, and the parameter value transmitted to B3 is 1234567890123456. The transaction involves four application containers, and database IP1, database IP2, database IP3, cache server NIP 1. And after relevant transaction links and relevant servers are collected through the code embedding points of S101, the transaction links and the relevant servers are processed and stored in the database, wherein the types stored in the database are as follows.
1. The test data mainly refers to the type of parameters and parameter values that are transferred when called between different services. The test data is stored in the database in the following format:
(unique index ID, source service name, called service name 1, parameter type 1, parameter name 2, parameter type 3, parameter name 3, parameter type 3, … …, parameter N, parameter type N);
(unique index ID, called service name 1, called service name 2, parameter name 1, parameter type 1, parameter name 2, parameter type 3, parameter name 3, parameter type 3, … …, parameter N, parameter type N);
(unique index ID, called service name 2, called service name 3, parameter name 1, parameter type 1, parameter name 2, parameter type 3, parameter name 3, parameter type 3, … …, parameter N, parameter type N).
Since the same service may be invoked by a plurality of different services, the different services involved in the same transaction are uniquely represented by unique index IDs. As with the information stored in the database above, it can be known that the link for the transaction is: the "source service name" calls "the called service name 1," the called service name 1 "calls" the called service name 2, "the called service name 2" calls "the called service name 3," and parameters and parameter types of each layer of service call.
Preferably, the parameter types may include, but are not limited to char, int, str, BigDecimal, Date, etc., and are stored in the database according to the actual collection situation.
2. Server device related data (server device data) refers to server devices involved in a transaction, including MYSQL, ORACLE, docker container, linux server, F5, SLB, DBLE, NOS cache server, Redis cache server, KAFKA, MQ, and other distributed system related type devices, and the like. The server device data is stored in the database in the following format:
(unique index ID, server type, server IP 1);
(unique index ID, server type, server IP 2).
Preferably, the server types herein include, but are not limited to, MYSQL, ORACLE, docker container, linux server, F5, SLB, DBLE, NOS cache server, Redis cache server, KAFKA, MQ, and other distributed system related device types, and the like.
Preferably, the different servers involved in the same transaction are uniquely represented by unique index IDs, since the same service may involve multiple different servers.
Next, a preferred step between the above steps 101 and 102, step 2 in fig. 2, is described.
In specific implementation, the collected test data and the data related to the server equipment are subjected to secondary processing (preprocessing), and are stored in a database according to a certain format (see the following mapping relation format).
In specific implementation, since the type or number of parameters of service transaction of each version may be changed, the test data of the previous version is not necessarily applicable to the next version, and thus the cleaning mechanism for the test data and the server device data in the database is once cleaned for each version.
In an embodiment, the chaos engineering method for a distributed system may further include: and preprocessing the test data and the server equipment data to obtain a mapping relation between the test data and the server equipment data.
In specific implementation, the secondary processing (preprocessing) refers to mapping processing of the collected test data and the server equipment data. Because the whole transaction link, the collected test data and the types of the server equipment are many, one-to-one mapping is needed, so that the correct test cases can be conveniently generated on the correct service types, and the test accuracy is further improved.
In specific implementation, the mapping relationship format may be as follows: service name | parameter name 1| parameter type 1| parameter name 2| parameter type 2| parameter value 2| … … parameter name n | parameter type n | parameter value n | server type | server IP.
Third, next, the above step 102, step 3 in fig. 2, is introduced.
In one embodiment, replacing the test data with abnormal data to form an abnormal data test case may include: and replacing the test data into abnormal data by field type replacement, field length replacement and special field replacement to form an abnormal data test case.
In specific implementation, the test data is replaced according to the collected test data, and normal test data is replaced by abnormal data through three replacement modes of field type replacement, field length replacement, special field replacement and the like to form an abnormal data test case. To facilitate an understanding of how embodiments of the present invention may be implemented, these three alternatives are described in detail below by way of example.
1. The field type replacement is to replace the parameter value of the field with a value which does not conform to the field type, so as to verify whether the system can correctly process the condition that the parameter value type of the field does not match. The code under the distributed system is mostly written by JAVA, and the invention is explained by taking the type in the JAVA code as an example:
the data types in JAVA are: when the original parameter type is one of the above-mentioned types, the original parameter value is replaced by another 7 types of parameters, so that the abnormal data is replaced and the abnormal parameter is generated. For example:
(parameter name a, int (4)), when replaced with an exception parameter, then:
(parameter name a, char type), examples are (parameter name a, B);
(parameter name a, String type), examples are (parameter name a, BBBB); and so on.
2. The field length replacement refers to the condition that for a field with limited length, a value exceeding the field length needs to be assigned to the field to verify whether the correct field length of the system is abnormal. For example, (parameter name a, int (4)), when the abnormal parameter is replaced, it is (parameter name a, 11111); thereby verifying that the system can process normally when the parameter value of the field exceeds the defined field length.
3. Special field replacement refers to the assignment of a special field, such as certain special field types, to be subject to certain logic rules. And (4) replacing the special field types by parameters which do not accord with the special field rules, and verifying whether the system can normally process.
Such as identification cards, telephone numbers, zip codes, dates, etc., with the values assigned being replaced by parameters that do not comply with the character rules. For example, the date is typically the year, month, day, a value is assigned in the code, typically in the form YYYY-MM-DD, where the "-" symbol is replaced with another character, e.g., YYYYFMMFDD, so that a rule is not met, an exception test case of date type is generated, and so on.
Fourth, next, the above step 103, step 4 in fig. 2, is introduced.
In specific implementation, the collected server equipment is combined with a fault expert library to generate the fault types related to the corresponding fault points, and a fault test case is formed.
In one embodiment, generating a fault type related to a corresponding fault point according to server device data and a pre-established fault expert database to form a fault test case may include:
obtaining the type of the related server equipment and the IP of the server equipment from the data of the server equipment;
and generating the fault types related to the corresponding fault points by combining a pre-established fault expert library according to the server equipment types and the server equipment IP to form fault test cases.
In specific implementation, the type of the related server equipment, the IP of the server equipment and the like are obtained from the related data of the server equipment, and then a fault test case is generated by combining a fault expert library according to the type of the server equipment and the IP of the server equipment. The fault expert database is an expert database set according to production and experience, and the main expression rules of the expert database are as follows: device type fault occurrence probability. The device types mainly comprise MYSQL, ORACLE, docker container, linux server, F5, SLB, DBLE, NOS cache server, Redis cache server, KAFKA, MQ and other device types related to the distributed system and the like. The fault types mainly comprise CPU, internal memory, unavailable network ports, network delay, network packet loss, network packet damage, network packet disorder, network packet retransmission, full disk space, busy disk IO, low disk IO rate and the like. The failure occurrence probability refers to how much probability a certain failure occurs for a certain type of server device, and the higher the occurrence probability, the greater the importance. The probability of the fault occurrence is mainly comprehensively considered according to the probability of the production occurrence and the probability of the test environment occurrence, and then the value is assigned. When the fault test cases are generated, the generation of the fault test cases with different important levels can be determined by setting the lowest threshold value of the fault occurrence probability. Examples are as follows:
the failure expert database (relationship between server device type, server device failure type, and failure occurrence probability) may include:
docker container | CPU full load | 75;
(ii) docker container JVM memory overflow | 80;
docker container | network port unavailable | 60;
docker container | network delay | 35;
docker container | network packet loss | 48;
docker container | network packet damaged | 30;
docker container | network packet out of order | 20;
docker container | network packet retransmission | 10;
docker container | disk space full | 20;
docker container | disk IO busy | 55;
docker container | disk IO rate low | 18;
NOS cache server | CPU full load | 75;
NOS cache server | JVM memory overflow | 80;
NOS cache server | network port unavailable | 60;
NOS cache server | network latency | 35;
NOS cache server | network packet loss | 48;
NOS cache server | network packet corruption | 30;
NOS cache server | network packet out of order | 20;
NOS cache server | network packet retransmission | 10;
NOS cache server | disk space full | 20;
NOS cache server | disk IO busy | 55;
NOS cache server | disk IO rate is low | 18.
The data of the server device collected by the embedded point can be (docker container, 122.18.xx.yy), if the lowest threshold value of the fault occurrence probability set by the tester is 40, the fault test cases with the fault occurrence probability higher than or equal to 40% are generated by combining the expert database, and the generated fault test case set is as follows:
(docker container, 122.18.xx.yy, CPU full, 75);
(docker container, 122.18.xx.yy, JVM memory overflow, 80);
(docker container, 122.18.xx.yy, network port unavailable, 60);
(docker container, 122.18.xx.yy, network packet loss, 48);
(docker container, 122.18.XX.YY, disk IO busy, 55).
Fifth, next, the above step 104, i.e., step 5 in fig. 2, is introduced.
During specific implementation, corresponding test cases are automatically scheduled through a scheduling tool, relevant monitoring data are collected, test results are stored in a database, and abnormal test results are sent to slave personnel.
Preferably, the scheduling tool includes two types, namely scheduling for the abnormal data test case and scheduling for the failure test case. The exception data test cases are initiated primarily by a Jmeter or postman. The failure test case is initiated primarily by the chaos blade tool. The detailed steps for executing the abnormal data test case and the fault test case are described below.
1. First, an execution exception data test case is introduced.
In one embodiment, executing the abnormal data test case and the fault test case to obtain the test result of the distributed system may include: executing the abnormal data test case according to the following method to obtain a test result corresponding to the abnormal data test case:
judging whether the number of the abnormal data test cases to be tested in the abnormal data test case set is greater than 0 or not;
when the number of the abnormal data test cases to be tested is more than 0, executing the abnormal data test cases through a Jmeter tool or a postman tool;
acquiring data when executing the case to obtain a test result corresponding to the abnormal data test case, and storing the test result corresponding to the abnormal data test case into a database;
and inquiring an abnormal test result from the database, and sending the abnormal test result to a tester.
In specific implementation, as shown in fig. 4, the execution process of the abnormal data test case provided in this embodiment specifically includes the following steps.
The abnormal data test case is mainly initiated through a Jmeter or a postman, and comprises the following processing flows:
s301: and starting to execute the abnormal data test cases, and judging whether the number of the abnormal data test cases to be tested is more than 0. If the number of test cases of the abnormal data to be tested is greater than 0, step S302 is executed. If the number of test cases of the abnormal data to be tested is not greater than 0, step S304 is executed.
S302: the exception data test case is executed by a test tool such as Jmeter or Postman.
S303: and collecting test data and test results, and storing the test data and the test results in a database.
S304: and inquiring an abnormal test result from the database, and sending the abnormal test result to a tester.
2. Next, a case of performing a failure test is described.
In one embodiment, executing the abnormal data test case and the fault test case to obtain the test result of the distributed system may include: executing the fault test case according to the following method to obtain a test result corresponding to the fault test case:
installing a chaosblade medium on a server to be injected with a fault;
judging whether the number of the fault test cases to be tested in the fault test case set is greater than 0;
when the number of the fault test cases to be tested is more than 0, initiating high concurrency of preset transactions through a LoadRunner tool or a Jmeter tool;
executing one fault test case in the fault test case set, and injecting a corresponding fault on a corresponding server IP through a ChaosBlade tool according to the fault type of the fault test case;
acquiring monitoring data to obtain a test result corresponding to the fault test case when the case is executed, and storing the test result into a database;
inquiring an abnormal test result from the database, and sending the abnormal test result to a tester;
and when the execution of all the fault test cases in the fault test case set is completed, the ChaosBlade medium is cancelled.
In one embodiment, the monitoring data may include: resource monitoring data and system monitoring data.
In specific implementation, as shown in fig. 5, the fault test case execution process provided in this embodiment specifically includes the following steps.
The fault test case is mainly used for injecting faults through a chaosblade tool and comprises the following processing procedures:
s401: installing a chaosblade medium on the server to be injected with the failure.
The ChaosBlade is a tool for injecting faults of an open source code of the Ali baba in 2018, can be used for injecting faults of a CPU, a memory, a network, a disk and the like, and also supports secondary development and optimization according to needs.
S402: and judging whether the number of the fault test cases to be tested is more than 0. If the number of the tested failure test cases is greater than 0, step S403 is executed. If the number of test cases to be tested is not greater than 0, step 406 is performed.
S403: high concurrency for a certain transaction is initiated through LoadRunner or meter.
Preferably, the number of concurrent users on production at peak hours for a certain transaction may be obtained first. The high concurrency of the step can take the number of concurrent users of a certain transaction in the production in the peak period as the number of concurrent users, and high concurrency pressure test can be initiated.
Preferably, LoadRunner and Jeter are tools that initiate pressure tests, where high concurrency of initiating transactions is primarily used to simulate in-transit transactions on production.
S404: and executing a fault test case, and injecting a corresponding fault through a ChaosBlade command according to the fault type of the fault test case.
And executing the fault test case generated in the step 4 in the fig. 2, and injecting the corresponding fault on the corresponding server IP through a chaos blade command according to the fault type in the case.
S405: and collecting monitoring data including resource monitoring data and system monitoring data, and storing the data in a database.
Preferably, the resource monitoring data includes a CPU, a memory, a network IO, a disk IO, and the like of the server, and may be acquired by an NMON tool or other resource acquisition tools.
Preferably, the system monitoring data includes transaction number per second, transaction response time, number of concurrent users, successful transaction number, failed transaction number, etc. and can be collected through LoadRunner or meter.
Preferably, the collected data is stored in a database to facilitate querying. Since there may be changes to the service code for each version, each version is tested once, and the clean-up mechanism for the monitoring data in the database cleans up once for each version.
S406: and inquiring an abnormal test result from the database, and sending the abnormal test result to a tester.
S407: and completing the execution of all fault test cases, canceling the ChaosBlade medium, and restoring the test environment to the original state.
The distributed system chaotic engineering method provided by the embodiment of the invention has the advantages that:
1. the form of combining the abnormal data test case and the fault test case is provided for the first time as the test case of the chaotic engineering, and the test coverage of the embodiment of the invention is larger than the coverage of only using the abnormal data test case or only using the fault test case, so that various abnormal scenes of the chaotic engineering can be effectively and comprehensively covered.
2. The transaction link, the parameters, the server equipment, the server type and the like related to the transaction are obtained in a point burying mode, the test data are replaced into abnormal data according to a certain rule, a fault test case is generated according to the server equipment type, the test efficiency and the test effect can be improved without extra manpower input, and the robustness and the high availability of the distributed system are further improved.
The embodiment of the invention also provides a chaos engineering device of a distributed system, which is described in the following embodiment. Because the principle of the device for solving the problems is similar to that of the chaos engineering method of the distributed system, the implementation of the device can refer to the implementation of the chaos engineering method of the distributed system, and repeated parts are not described again.
Fig. 6 is a schematic structural diagram of a chaos engineering apparatus of a distributed system in an embodiment of the present invention, and as shown in fig. 6, the apparatus includes:
the acquisition unit 01 is used for acquiring test data of the distributed system and server equipment data acquired through code embedded points;
the abnormal data test case generating unit 02 is used for replacing the test data with abnormal data to form an abnormal data test case;
the fault test case generating unit 03 is used for generating fault types related to corresponding fault points according to the server equipment data and a fault expert database established in advance to form fault test cases; the fault expert database is the relation among the server equipment type, the server equipment fault type and the fault occurrence probability;
and the test unit 04 is used for executing the abnormal data test case and the fault test case to obtain a test result of the distributed system.
In an embodiment, the abnormal data test case generating unit may be specifically configured to: and replacing the test data into abnormal data by field type replacement, field length replacement and special field replacement to form an abnormal data test case.
In one embodiment, the test unit may be specifically configured to: executing the fault test case according to the following method to obtain a test result corresponding to the fault test case:
installing a chaosblade medium on a server to be injected with a fault;
judging whether the number of the fault test cases to be tested in the fault test case set is greater than 0;
when the number of the fault test cases to be tested is more than 0, initiating high concurrency of preset transactions through a LoadRunner tool or a Jmeter tool;
executing one fault test case in the fault test case set, and injecting a corresponding fault on a corresponding server IP through a ChaosBlade tool according to the fault type of the fault test case;
acquiring monitoring data to obtain a test result corresponding to the fault test case when the case is executed, and storing the test result into a database;
inquiring an abnormal test result from the database, and sending the abnormal test result to a tester;
and when the execution of all the fault test cases in the fault test case set is completed, the ChaosBlade medium is cancelled.
In one embodiment, the monitoring data may include: resource monitoring data and system monitoring data.
In one embodiment, the test unit may be specifically configured to: executing the abnormal data test case according to the following method to obtain a test result corresponding to the abnormal data test case:
judging whether the number of the abnormal data test cases to be tested in the abnormal data test case set is greater than 0 or not;
when the number of the abnormal data test cases to be tested is more than 0, executing the abnormal data test cases through a Jmeter tool or a postman tool;
acquiring data when executing the case to obtain a test result corresponding to the abnormal data test case, and storing the test result corresponding to the abnormal data test case into a database;
and inquiring an abnormal test result from the database, and sending the abnormal test result to a tester.
In one embodiment, the failure test case generation unit may be specifically configured to:
obtaining the type of the related server equipment and the IP of the server equipment from the data of the server equipment;
and generating the fault types related to the corresponding fault points by combining a pre-established fault expert library according to the server equipment types and the server equipment IP to form fault test cases.
In an embodiment, the chaos engineering method for a distributed system may further include: and the preprocessing unit is used for preprocessing the test data and the server equipment data to obtain a mapping relation between the test data and the server equipment data.
In an embodiment, the chaos engineering method for a distributed system may further include: and the primary processing unit is used for carrying out primary processing on the test data to obtain the test data after the primary processing.
In one embodiment, the preliminary processing unit is specifically configured to: and splitting the test data message string to obtain the test data of the message string with the preset format.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the chaos engineering method of the distributed system when executing the computer program.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing the chaotic engineering method of the distributed system is stored in the computer-readable storage medium.
In the embodiment of the invention, compared with the technical scheme that the test coverage is not complete, the robustness and the high availability of the system cannot be effectively improved, and the test case needs to be designed manually and then executed, so that the efficiency is low in the chaos engineering scheme of the distributed system, the chaos engineering scheme of the distributed system is characterized in that: acquiring test data of a distributed system and server equipment data through code embedding points; replacing the test data with abnormal data to form an abnormal data test case; generating a fault type related to a corresponding fault point according to server equipment data and a pre-established fault expert library to form a fault test case; the fault expert database is the relation among the server equipment type, the server equipment fault type and the fault occurrence probability; and executing the abnormal data test case and the fault test case to obtain the test result of the distributed system, thereby comprehensively and efficiently improving the robustness and the high availability of the distributed system.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A chaotic engineering method for a distributed system, comprising:
acquiring test data of a distributed system and server equipment data through code embedding points;
replacing the test data with abnormal data to form an abnormal data test case;
generating a fault type related to a corresponding fault point according to server equipment data and a pre-established fault expert library to form a fault test case; the fault expert database is the relation among the server equipment type, the server equipment fault type and the fault occurrence probability;
and executing the abnormal data test case and the fault test case to obtain the test result of the distributed system.
2. The chaotic engineering method for distributed systems according to claim 1, wherein the step of replacing the test data with anomalous data to form anomalous data test cases comprises: and replacing the test data into abnormal data by field type replacement, field length replacement and special field replacement to form an abnormal data test case.
3. The chaotic engineering method for distributed systems according to claim 1, wherein the step of executing the abnormal data test case and the failure test case to obtain the test result of the distributed system comprises: executing the fault test case according to the following method to obtain a test result corresponding to the fault test case:
installing a chaosblade medium on a server to be injected with a fault;
judging whether the number of the fault test cases to be tested in the fault test case set is greater than 0;
when the number of the fault test cases to be tested is more than 0, initiating high concurrency of preset transactions through a LoadRunner tool or a Jmeter tool;
executing one fault test case in the fault test case set, and injecting a corresponding fault on a corresponding server IP through a ChaosBlade tool according to the fault type of the fault test case;
acquiring monitoring data to obtain a test result corresponding to the fault test case when the case is executed, and storing the test result into a database;
inquiring an abnormal test result from the database, and sending the abnormal test result to a tester;
and when the execution of all the fault test cases in the fault test case set is completed, the ChaosBlade medium is cancelled.
4. The chaotic engineering method of a distributed system of claim 3, wherein the monitoring data comprises: resource monitoring data and system monitoring data.
5. The chaotic engineering method for distributed systems according to claim 1, wherein the step of executing the abnormal data test case and the failure test case to obtain the test result of the distributed system comprises: executing the abnormal data test case according to the following method to obtain a test result corresponding to the abnormal data test case:
judging whether the number of the abnormal data test cases to be tested in the abnormal data test case set is greater than 0 or not;
when the number of the abnormal data test cases to be tested is more than 0, executing the abnormal data test cases through a Jmeter tool or a postman tool;
acquiring data when executing the case to obtain a test result corresponding to the abnormal data test case, and storing the test result corresponding to the abnormal data test case into a database;
and inquiring an abnormal test result from the database, and sending the abnormal test result to a tester.
6. The chaotic engineering method of a distributed system according to claim 1, wherein a fault type related to a corresponding fault point is generated according to server device data and a fault expert library established in advance to form a fault test case, including:
obtaining the type of the related server equipment and the IP of the server equipment from the data of the server equipment;
and generating the fault types related to the corresponding fault points by combining a pre-established fault expert library according to the server equipment types and the server equipment IP to form fault test cases.
7. The chaotic engineering method for distributed systems of claim 1, further comprising: and preprocessing the test data and the server equipment data to obtain a mapping relation between the test data and the server equipment data.
8. A chaos engineering apparatus for a distributed system, comprising:
the acquisition unit is used for acquiring test data of the distributed system and server equipment data acquired through code embedded points;
the abnormal data test case generating unit is used for replacing the test data with abnormal data to form an abnormal data test case;
the fault test case generation unit is used for generating fault types related to corresponding fault points according to the server equipment data and a pre-established fault expert database to form fault test cases; the fault expert database is the relation among the server equipment type, the server equipment fault type and the fault occurrence probability;
and the test unit is used for executing the abnormal data test case and the fault test case to obtain a test result of the distributed system.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 7.
CN202110603040.5A 2021-05-31 2021-05-31 Chaotic engineering method and device for distributed system Active CN113342650B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110603040.5A CN113342650B (en) 2021-05-31 2021-05-31 Chaotic engineering method and device for distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110603040.5A CN113342650B (en) 2021-05-31 2021-05-31 Chaotic engineering method and device for distributed system

Publications (2)

Publication Number Publication Date
CN113342650A true CN113342650A (en) 2021-09-03
CN113342650B CN113342650B (en) 2024-07-02

Family

ID=77473279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110603040.5A Active CN113342650B (en) 2021-05-31 2021-05-31 Chaotic engineering method and device for distributed system

Country Status (1)

Country Link
CN (1) CN113342650B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115271736A (en) * 2022-07-11 2022-11-01 中电金信软件有限公司 Method, device, equipment, storage medium and product for verifying transaction consistency

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080141224A1 (en) * 2006-12-01 2008-06-12 Shinichiro Kawasaki Debug information collection method and debug information collection system
CN107861867A (en) * 2017-10-24 2018-03-30 阿里巴巴集团控股有限公司 Page fault monitoring method, device, system and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080141224A1 (en) * 2006-12-01 2008-06-12 Shinichiro Kawasaki Debug information collection method and debug information collection system
CN107861867A (en) * 2017-10-24 2018-03-30 阿里巴巴集团控股有限公司 Page fault monitoring method, device, system and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115271736A (en) * 2022-07-11 2022-11-01 中电金信软件有限公司 Method, device, equipment, storage medium and product for verifying transaction consistency

Also Published As

Publication number Publication date
CN113342650B (en) 2024-07-02

Similar Documents

Publication Publication Date Title
Xu et al. POD-Diagnosis: Error diagnosis of sporadic operations on cloud applications
CN112395177B (en) Interactive processing method, device, equipment and storage medium for business data
CN107818431B (en) Method and system for providing order track data
US20040153837A1 (en) Automated testing
CN112631846A (en) Fault drilling method and device, computer equipment and storage medium
CN107870948A (en) Method for scheduling task and device
CN109254912A (en) A kind of method and device of automatic test
CN111913824B (en) Method for determining data link fault cause and related equipment
US9823999B2 (en) Program lifecycle testing
CN114422386B (en) Monitoring method and device for micro-service gateway
Kesim et al. Identifying and prioritizing chaos experiments by using established risk analysis techniques
CN113342650B (en) Chaotic engineering method and device for distributed system
CN112202647A (en) Test method, device and test equipment in block chain network
CN113032281B (en) Method and device for acquiring code coverage rate in real time
CN112506802B (en) Test data management method and system
CN113672452A (en) Method and system for monitoring operation of data acquisition task
CN111722917A (en) Resource scheduling method, device and equipment for performance test task
JP2014035595A (en) Testing device for communication system, testing program for communication system, and testing method for communication system
CN110609761B (en) Method and device for determining fault source, storage medium and electronic equipment
CN111651346B (en) Method and device for testing front-end component, storage medium and computer equipment
CN113656210A (en) Processing method and device for error reporting information, server and readable storage medium
CN112230897A (en) Monitoring method and device for bank branch interface reconstruction
CN112965793A (en) Data warehouse task scheduling method and system oriented to identification analysis data
CN110489208A (en) Virtual machine configuration parameter check method, system, computer equipment and storage medium
Chen et al. Big data system testing method based on chaos engineering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant